Details of this dataset

Gambia_Indie-MH, 4 plates prepped between 2023_03 & 2023_04

Import data

Sample size, loci BEFORE cleanup

Explore data structure

## [1] "Sample size = 346"
## [1] "Loci = 244"
## [1] "1 to 130 alleles per locus per sample"

Merging data

Combine multiple sequencing runs

Combine with parasite density

Combine with plate map

Plate map

Inspect all samples and controls in their spatial arrangement

This is a heatmap of average reads per locus for each sample

Track reads through filtering

Estimates proportion of primer dimers

grey = input, colored = amplicons

## [1] "Percent dimers by sample prep date (in %, already multiplied by 100)"
## # A tibble: 4 × 2
##   prep_date  dimer_percent
##   <chr>              <dbl>
## 1 2023_03_06         0.549
## 2 2023_03_28         0.356
## 3 2023_03_30         0.458
## 4 2023_04_05         0.825

Negative controls

Check that the number of reads is low

Check if the reads found correspond to the pools you amplified (1A and 5)

## [1] "List of pool 1A/1B/2 loci with >100 reads (summed over all negative controls)"
## # A tibble: 25 × 5
##    locus                         reads pool  Category Reason to include (if dr…¹
##    <chr>                         <int> <chr> <chr>    <chr>                     
##  1 Pf3D7_13_v3-2814583-2814832-2   344 2     <NA>     HRP3 deletion             
##  2 Pf3D7_11_v3-1902041-1902278-2   290 2     <NA>     ch11                      
##  3 Pf3D7_13_v3-1726174-1726408-2   269 2     <NA>     <NA>                      
##  4 Pf3D7_12_v3-2092654-2092884-2   249 2     <NA>     100                       
##  5 Pf3D7_13_v3-1725124-1725365-2   195 2     <NA>     613                       
##  6 Pf3D7_13_v3-1725887-1726123-2   176 2     <NA>     <NA>                      
##  7 Pf3D7_13_v3-2770591-2770863-2   170 2     <NA>     HRP3 deletion             
##  8 Pf3D7_11_v3-1950431-1950680-2   169 2     <NA>     ch11                      
##  9 Pf3D7_14_v3-294503-294768-2     160 2     <NA>     cnv                       
## 10 Pf3D7_03_v3-221295-221532-2     159 2     <NA>     C-terminal                
## # ℹ 15 more rows
## # ℹ abbreviated name: ¹​`Reason to include (if drug resistance: aminoacids)`

Positive controls

Using a known parasite strain at a known density

Alleles should be known, and monoclonal

Presence of polyclonals suggest contamination => see plate map
<10 (arbitratry) reads or at <1% (arbitrary) of locus should be filtered out

## [1] "Mean of total locus = 214, Mean prop. of monoclonal loci = 0.85, Mean prop. of polyclonal loci = 0.15"
## [1] "Pool 1A loci that are abset in positive controls:"
## [1] "PKNH_12_v2-198869-199113-1AB"      "PmUG01_12_v1-1397996-1398245-1AB" 
## [3] "PocGH01_12_v1-1106456-1106697-1AB" "PvP01_12_v1-1184983-1185208-1AB"

Look at total amplification

All pools combined

## 
##  1A 1AB  1B 1B2   2 
## 165   5  75   2  29

Species check

Controls should be Pfal, and not other species

NonPfal detection in samples suggests mixed infection (for downstream analysis)

These loci will be filtered out downstream (loci 1AB)

Look at individual pools

Pool 1A

## [1] "No reads in any samples for these loci, for 1A"
## # A tibble: 0 × 3
## # ℹ 3 variables: locus <chr>, pool <chr>, n <dbl>
## [1] "List of samples that do not have >75% loci with >100 reads, in pool 1A"
## # A tibble: 172 × 7
## # Groups:   sampleID [172]
##    sampleID                        pool  totreads   n50  n100 neg_control   norm
##    <chr>                           <chr>    <int> <int> <int> <lgl>        <dbl>
##  1 Gambia_Indie-1B_MH_1_AE_2023_0… 1A         235     0     0 FALSE       0     
##  2 Gambia_Indie-1B_MH_1_AE_2023_0… 1A        1989     4     0 FALSE       0     
##  3 Gambia_Indie-1B_MH_1_AE_2023_0… 1A        8057    61    18 FALSE       0.112 
##  4 Gambia_Indie-1B_MH_1_AE_2023_0… 1A         464     0     0 FALSE       0     
##  5 Gambia_Indie-1B_MH_1_AE_2023_0… 1A         989     0     0 FALSE       0     
##  6 Gambia_Indie-1B_MH_1_AE_2023_0… 1A        2786     7     0 FALSE       0     
##  7 Gambia_Indie-1B_MH_1_AE_2023_0… 1A        4984    28     6 FALSE       0.0375
##  8 Gambia_Indie-1B_MH_1_AE_2023_0… 1A        7523    63    16 FALSE       0.1   
##  9 Gambia_Indie-1B_MH_1_AE_2023_0… 1A        9277    77    23 FALSE       0.144 
## 10 Gambia_Indie-1B_MH_1_AE_2023_0… 1A         698     0     0 FALSE       0     
## # ℹ 162 more rows

Pool 5

## 
## TRUE 
##   47

## [1] "Loci without any reads, pool 5"
## # A tibble: 0 × 16
## # ℹ 16 variables: locus-pool <chr>, Category <chr>, chr_malaria <chr>,
## #   ampInsert_start <dbl>, ampInsert_end <dbl>, GeneID <chr>, Gene <chr>,
## #   Reason to include (if drug resistance: aminoacids) <chr>,
## #   Amino acid range (*: partial coverage of that aminoacid) <chr>,
## #   pool5 <lgl>, amplicon_id <chr>, amplicon_start <dbl>, amplicon_end <dbl>,
## #   fwd_primer <chr>, rev_primer <chr>, n <dbl>
## [1] "List of samples that do not have >75% alleles with >100 reads, pool 5"
## # A tibble: 154 × 7
## # Groups:   sampleID [154]
##    sampleID                        pool  totreads   n50  n100 neg_control   norm
##    <chr>                           <chr>    <int> <int> <int> <lgl>        <dbl>
##  1 Gambia_Indie-1B_MH_1_AE_2023_0… 1B          86     0     0 FALSE       0     
##  2 Gambia_Indie-1B_MH_1_AE_2023_0… 1B         647     2     0 FALSE       0     
##  3 Gambia_Indie-1B_MH_1_AE_2023_0… 1B        3216    22    10 FALSE       0.227 
##  4 Gambia_Indie-1B_MH_1_AE_2023_0… 1B         138     0     0 FALSE       0     
##  5 Gambia_Indie-1B_MH_1_AE_2023_0… 1B         364     0     0 FALSE       0     
##  6 Gambia_Indie-1B_MH_1_AE_2023_0… 1B         862     2     0 FALSE       0     
##  7 Gambia_Indie-1B_MH_1_AE_2023_0… 1B        1550    14     3 FALSE       0.0682
##  8 Gambia_Indie-1B_MH_1_AE_2023_0… 1B        2037    18     5 FALSE       0.114 
##  9 Gambia_Indie-1B_MH_1_AE_2023_0… 1B        3295    24    15 FALSE       0.341 
## 10 Gambia_Indie-1B_MH_1_AE_2023_0… 1B         266     0     0 FALSE       0     
## # ℹ 144 more rows

Pool 2

Skip if not used

List samples that fail any of the QC checks

Reprep if <50% amplicons with >100 reads

Repool if >50% but <75% amplicons with >100 reads

## Joining with `by = join_by(sampleID)`

Hypothetical filtering

Remove bad samples

Remove bad loci

Remove controls

## [1] "Sample size= 151"
## [1] "Number of entries excluded = 0, Proportion excluded = 0"
## [1] "Number of loci after cleanup = 242"
## character(0)